Aggregate Function

In Spark , you can perform aggregate operations on dataframe. This is similar to what we have in SQL like MAX, MIN, SUM etc. We can also perform aggregation on some specific columns which is equivalent to GROUP BY clause we have in typical SQL. Let’s see it with some examples.

val empDF = spark.createDataFrame(Seq(
      (7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
      (7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
      (7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
      (7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
      (7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
      (7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
      (7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
      (7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
      (7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
      (7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
      (7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)
    )).toDF("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno")
 
Create Temp View from dataframe
empDF.createOrReplaceTempView("emp")

First method  To calculate count, max, min, sum we can use below syntax:
import org.apache.spark.sql.functions._
empDF.select(min($"sal"),max($"sal"),avg($"sal"),count($"deptno")).show()
 

 
Second method we can use is “agg”. To calculate count, max, min, sum we can use below syntax:
import org.apache.spark.sql.functions._
empDF.agg(count($"deptno"),min($"sal"),max($"sal"),sum("sal")).show()

 
Find the number of records in a dataFrame
count function returns number of records present in the dataframe.
empDF.count

Find the sum of Salary by Department
import org.apache.spark.sql.functions._
empDF.groupBy($"deptno").sum("sal").show

 
import org.apache.spark.sql.functions._
empDF.groupBy($"deptno").agg(sum($"sal")).show

Find the maximum Salary by Department
import org.apache.spark.sql.functions._
empDF.groupBy($"deptno").max("sal").show

 
Find the maximum Salary and record count by Department
import org.apache.spark.sql.functions._
empDF.groupBy($"deptno").agg(Map("deptno"->"count","sal"->"max")).show

No comments:

Post a Comment